\chapter{Implementation}\label{chap:5}
In this chapter, the work of implementing malware classification system will be described. Starting with the implementing environment, this chapter will go through introduction of the meta-data used in
this system, and finally system software implementation which is written in PHP language.
%EXPERIMENTAL RESULTS AND ANALYSIS
%
%
\section{Environment}

In proposing system, operator system, Database, and are all operated in one computer. Technical information of employed computer is shown as in Table \ref{table:environment}. DecisionTree $1.5$ is good choice to this implementation, with these following reasons:
\begin{itemize}
\item The library constructs a decision tree from multidimensional training data.
\item The library use the decision tree thus constructed for classifying unlabeled data. \cite{decisiontreelibrary}.
\end{itemize} 
  \begin{table}
\begin{center}  
\label{table:environment}
    \begin{tabular}{ | l | l | l | }
     \hline
    Element &  Attribute  &  Environment \\ \hline
    OS & Computer & Ubuntu 11.04 \\ \hline
	Database & Mysql & Mysql 5.0 \\ \hline
	Programming language & C & gcc  \\ \hline
	 & PHP & PHP 5.0  \\ \hline
	 & Python & Python 2.7 \\ \hline
	 Library & DecisionTree 1.5 &  ID3 \\ \hline
    \end{tabular}
    \caption{Implementation environment of the malware classification system.}  
    \end{center}
  \end{table}
  
\section{Overview}
The system uses machine learning technique, named decision tree algorithm for malware classification. The goal is classifying malware fast and decision tree algorithm is used to achieve. Classification is a form of supervised learning, which requires training data, with known input/output, to form knowledge \cite{tonylee}. As mentioned in Figure \ref{fig:system_architec}, the system contains three following parts.

\begin{itemize}
\item First part: Read binary file to take meta data and input to database.
\item Second part: Manually cluster malware loaded from database.
\item Third part: Use decision tree algorithm to classify malware.
\end{itemize}
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]
{graph/system_architec.jpg}
\caption{The system architecture.}
\label{fig:system_architec}
\end{figure}

A vast amount of malware need analyzing, and the system uses database to store malware file's meta-data, to easily control a large number of data and to share them in the internet among web browsers.
The flow of data is shown in the Figure \ref{fig:system_architec}. At first, binary file meta-data are read, and all of the meta-data are input into database. In the second part, the system exports meta-data in database as the training data to create the decision tree for malware classifying. Finally, the decision tree algorithm created in the second step is utilized to class  unknown malware into the different malware families. 
%CLASSFICATION BASE ON DECISION TREE
%
%
\section{Classification based on machine learning technique} 
\subsection{Meta-data}

Meta-data is simply data about data. The malware file's meta-datas are all simply information about malware file. PE header includes: meta-data of MS-DOS section; COFF file header; Optional header; Section header; and other types of data in sections such as API import and export table. Meta-datas contain many data about malware. The malware classification system based on can identify malware changed on signature and syntax by using code reordering, code rewriting, and instruction insertion. Additionally, malware's meta-data can be easily received. Therefore, in order to create a fast and high accuracy malware classification system, PE header's meta-data is good choice to classification malware system. 

Malware file's meta-data are as shown in Table \ref{table:metadata}:
\begin{table}
\begin{center}
    \begin{tabular}{ | l | l |}
     \hline
    Meta-data classes & fields \\
  PE header & Magic, MajorLinkerVersion, SizeOfCode,\\
  & SizeOfInitializedData, SizeOfUninitializedData,  \\
 & AddressOfEntryPoint, BaseOfCode, BaseOfData, ImageBase,... \\ 
 \hline
	Dll &   wininet1dll, ole32dll, user321dll, gdi32dll,\\ 
	 & advapi321dll, crtdlldll, ntdlldll, shell32dll,...\\ 
	 \hline
	Session &  UPX0, UPX1, UPX2, text, data, rdata, rsrc, bss, \\
 & idata, rsrc1, aspack, packed, q5p6, \\
 & q1rk2j, 4895f, t623ai, weitl, edata, ndata,.. \\ 
 \hline
	API &  WritePrivateProfileStringA, \\
& GetPrivateProfileStringA, WriteFile, ReadFile, MulDiv,\\
& FindNextFileA, FindFirstFileA, DeleteFileA, CopyFileA,...\\  
 \hline
    \end{tabular}
	\end{center}
     \caption{The data table.}
    \label{table:metadata}
\end{table}

At first, there is a problem that PE file's meta data value is too large to detect unknown malwares which have meta-data's differences from the training data of the system. The meta-data of PE header file are separated  to give semantic information for malware classification. For example, normally \emph{ImageBase} field of PE header file have value 400000h \cite{goppit} so that the vaue of ImageBase of malware file in our system is 0 if it has 40000h and it is 1 if it has other value, and in this research that value is called as semantic value. 

\subsection{Create training data}

In this system use decision tree technique. The attributes of decision tree described in previous subsection. In here, we mentions the method of identifying malware family name based on malware name provided by anti-virus vendors.

This research applies virustotal service to take the name of malware provided by the vendors and uses it to cluster malware into malware families with semantic information.

Figure \ref{fig:clustering} indicates HTTP post technique to automatically get name from Virus total service and cluster malware into malware families which given in chapter \ref{chap:2}.

Malware families in this classification system are as follows: Win32/Virut, Win32/Autorun, Win32/IRCbot, Win32/Gaobot, Win32/Waledac, Win32/Downadup, Win32/Sality, and W32.Mota. 

The disadvantage of using Virustotal to identify malware's families is that a unique malware has many different names provided by many antis-virus vendors. Therefore, if unique malware has two or more families name provided by anti-virus vendor, unique malware will be classed into the most reported malware family.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{graph/clustering.jpg}
\caption{Clustering method.}
\label{fig:clustering}
\end{figure}

\subsection{Classification}

To make malware classification rapid and correct, decision tree algorithm is utilized. Thus, it is easy to update a new malware family, the classification uses decision tree algorithm to determine each malware family. 
Decision tree 1.5 library is good choice for implementing malware classification system.  

Here is sample code:
\begin{lstlisting}[caption=Python Implementation of Decision tree 1.5, label=meinLabel]
  dt = DecisionTree( training_datafile = "training.dat", debug1 = 1 )
   dt.get_training_data()
   dt.show_training_data()
   root_node = 		    dt.construct_decision_tree_classifier()
   root_node.display_decision_tree("   ")
\end{lstlisting}

 
Malware' meta-data is used as input data for each decision tree, and the decision tree determines the malware family. Lists of malware PE header file's meta-data are Magic, MajorLinkerVersion, MinorLinkerVersion, SizeOfCode, SizeOfInitializedData, SizeOfUninitializedData, AddressOfEntryPoint, BaseOfCode, BaseOfData, ImageBase, SectionAlignment, FileAlignment, MajorOperatingSystemVersion, MinorOperatingSystemVersion, MajorImageVersion, MinorImageVersion, MajorSubsystemVersion, MinorSubsystemVersion, Reserved1, SizeOfImage, SizeOfHeaders, CheckSum, Subsystem, DllCharacteristics, SizeOfStackReserve, SizeOfStackCommit, SizeOfHeapReserve, SizeOfHeapCommit, LoaderFlags, NumberOfRvaAndSizes, Machine, NumberOfSections, TimeDateStamp, PointerToSymbolTable, NumberOfSymbols, SizeOfOptionalHeader, Characteristics, and other fields. The value of each field in PE header has semantics for decision tree algorithm if it is appeared at 10\% in malware training data. With this approach, the semantic value list of meta-data malware is made. With this approach, the list semantic value of meta-data of malware is generated. For example, this is list meta-data created:1, 2, 6, 4, 0, 1, 1, 4, 0, 200, 4, 0, 0, 0, 4, 0, 0, 1.00E+00, 200, 0, 2, 0, 100000, 4000, 100000, 1000, 0, 10, 14c, 2, 1000, 8, 0, e0, 3. This is value before generating :35328, 10b, 2, 0, 6200, 0, 200, b32e, b000, 9000, 400000, 1000, 200, 3, 0, 0, 0, 4, 0, 0, d000, 400, afe8, 2, 0, 1000, 1000, 10000, 0, 0, 10, 14c, 6, 0, 0, 0, e0, 10e.
 
The decision tree exported by system is as the following code. Probabilities of each node is the probabilities of each families. The most probable family is the family of malware sample.

\begin{lstlisting}{Decision tree exported by system }{Decision tree exported by system}
NODE 0:
   Branch features and values to this node: []
   Class probabilities at current node: [0.28916876574307304, 0.03853904282115869, 0.020151133501259445, 0.0619647355163728, 0.01057934508816121, 0.5531486146095718, 0.011083123425692695, 0.015365239294710327]
   Entropy at current node: 1.76730915987
   Best feature test at current node: 209
NODE 1:
   Branch features and values to this node: ['209=>0']
   Class probabilities at current node: [0.38588640275387265, 0.04819277108433735, 0.0006884681583476765, 0.08296041308089501, 0.013769363166953529, 0.45197934595524963, 0.00963855421686747, 0.006884681583476764]
   Entropy at current node: 1.56845724671
   Best feature test at current node: 44
NODE 2:
   Branch features and values to this node: ['209=>0', '44=>0']
   Class probabilities at current node: [0.4968425922724381, 0.054748822372797955, 0.0005797736588481511, 0.11087157799969649, 0.013693701656603949, 0.31032084114450614, 0.00944503996959897, 0.0034976509255101565]
   Entropy at current node: 1.1010285852
   Best feature test at current node: 32
NODE 3:
   Branch features and values to this node: ['209=>0', '44=>0', '32=>0']
   Class probabilities at current node: [0.5693425835153504, 0.06312750788349168, 0.0006773557567694956, 0.12900589385502, 0.014474831409967637, 0.21330422996733156, 0.00852684536125038, 0.0015407522508188685]
   Entropy at current node: 0.793704327232
   Best feature test at current node: 208
NODE 4:
   Branch features and values to this node: ['209=>0', '44=>0', '32=>1']
   Class probabilities at current node: [0.06609160326169744, 0.004967753320681031, 0.0, 0.0031284685758396417, 0.009052702284896878, 0.8867347206306493, 0.014900395603026747, 0.015124356323208915]
   Entropy at current node: 0.793704327232
   Best feature test at current node: 12
.............
.............
\end{lstlisting}
The virtualization decision tree is shown in Figure    \ref{fig:classificationsystem} . Each interior corresponds to one of the meta-data of new malware.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{graph/classificationsystem.png}
\caption{The virtualization decision tree.}
\label{fig:classificationsystem}
\end{figure}